RHQ stores min, max, and average aggregate metrics. Prior to RHQ 4.9, those computations were performed by the RDBMS. Starting in RHQ 4.9, those computations are done by the RHQ Server since Cassandra does not provide aggregate functions. The implementation in 4.9 is non-performant and becomes increasingly slow as the amount of metric data increases. This led to the creation of BZ 1009945. The algorithm was pretty straightforward and is as follows,
Query metrics_index to determine set of schedules that have raw data to be aggregated
For each schedule
Fetch its raw data
Compute 1 hour metrics
Store 1 hour metric
Update metrics_index
Repeat steps 1 - 3 for computing 6 hour metrics (if necessary)
Repeat steps 1 - 3 for computing 24 hour metrics (if necessary)
These steps and in particular, reads and writes, were done serially. Since 4.9 substantial changes have been, taking advantage of the asynchronous DataStax driver to execute the aforementioned steps for multiple schedules in parallel. Fast execution time for the overall aggregation run can be maintained even as the amount of metric data continues to increase. Maintaining performance levels can be accomplished simply by growing the Storage Cluster and increasing request throughput (see Request Throttling for more details about request throughput).
Here is a brief summary of how the driver's asynchronous API is exploited to achieve greater throughput. Multiple queries are executed to fetch raw data for a batch of schedules in parallel. This batch query executed is performed concurrently by multiple threads. Then the computations of multiple aggregates occur concurrently. Finally the writes to store the aggregate metrics for a given batch are executed in parallel.
Maintaining fast execution times for aggregation is easy enough, but it comes at a cost. That cost is heap space. Hundreds or even thousands of queries to fetch metric data were executed in parallel. This could easily and quickly fill up the heap and subsequently cause OutOfMemoryErrors. Request Throttling does not address the problem adequately. To illustrate consider the following example. Suppose the storage client request throughput is set at 10,000 requests / second. To put that in perspective, that is a very modest throughput even for a default deployment. That throughput though will allow for 10,000 concurrent queries per second. The maximum size payload for an hour of raw data is about 30 KB. In a worst case scenario, that would require about 292 MB of space on the heap. Within a matter of few a seconds, over a 1 GB of heap space could be needed for the queried raw data. This simply is not sustainable.
Fortunately there is an easy solution. The number of concurrent reads needs to be throttled. But as we have already discussed, the existing throttling with RateLimiter is not the best option here. Semaphore is a better fit since it does limit concurrent access to some resource. There are four settings that can be adjusted to tune metrics aggregation performance.
This refers to overall throughput in terms of requests per second. It can be set via the rhq.storage.request-limit system property. It defaults to 30,000.
This refers to the number of schedules for which data will be fetched. It can be configured via the rhq.metrics.aggregation.batch-size system property. It defaults to 25. Setting this to higher values can improve aggregation performance at the expense of greater heap utilization.
This specifies the number of batches that can be processed in parallel. It can be configured via the rhq.metrics.aggregation.parallelism system property. It defaults to 5. In terms of implementation, recall that a Semaphore is used to limit the number of concurrent queries. The actual number of permits that the Semaphore has is determined by multiplying batch-size and parallelism. A higher value can improve aggregation performance but again at the expense of greater heap utilization.
This specifies the size of the thread pool used during aggregation. It can be configured via the rhq.metrics.aggregation.workers system property. It defaults to ceiling(5, num_cores).